fsutil: add NFS soft-mount options to prevent kernel panic on hot-unplug#149
fsutil: add NFS soft-mount options to prevent kernel panic on hot-unplug#1490-danielviktorovich-0 wants to merge 1 commit into
Conversation
When user physically disconnects USB-C cable from an anylinuxfs-managed
device without running `anylinuxfs unmount` first, macOS NFS client
(default hard mount semantics) retries indefinitely against the
now-unreachable NFS server inside libkrun. The kernel holds
`IOMediaBSDClient` in busy state until `watchdogd` triggers
`panic(busy timeout[1])` after 60s.
Reproduced 3 times over 8 days on Mac16,8 / M4 Pro with identical
signature in `/Library/Logs/DiagnosticReports/panic-full-*.panic`:
panic(cpu N): busy timeout[1], (60s):
'IOMediaBSDClient' (1,1812001) @IOService.cpp:5986
Panicked task ... pid <N>: watchdogd
last started kext: com.apple.iokit.SCSITaskUserClient
The existing `deadtimeout=45` option supports Finder's manual eject
path but does not cover scheduled background I/O (Spotlight reindex,
Time Machine attempts, mds_stores, daemon polling) that hits the dead
mount after hot-unplug. macOS does not auto-teardown NFS mounts on
physical disconnect — `DiskArbitration` only fires callbacks for
registered listeners, which we don't have outside synchronous CLI flow
(see fsutil.rs comment near line 206 acknowledging the gap).
Soft-mount semantics with bounded timeouts return EIO after ~30s
(3 retries × 10s `timeo`) instead of holding the registry busy.
Returning EIO is appropriate when the physical device is gone —
operations that would have hung forever now produce a meaningful
error and the kernel releases the IOKit entry.
Includes regression test in `fsutil::tests::default_nfs_opts_include_soft_mount_semantics`.
Discussed in GitHub issue (to be filed alongside this PR).
There was a problem hiding this comment.
Code Review
This pull request updates the default NFS mount options for macOS in anylinuxfs/src/fsutil.rs to include soft, timeo=100, and retrans=3. These changes bound kernel-level retries when the microVM becomes unreachable, preventing potential kernel panics caused by indefinite retries. A regression test was also added to ensure these options remain in the default configuration. I have no feedback to provide.
|
Thank you for the pull request. Soft mount sounds reasonable. I'm going to review the change. Just note that your AI agent got some of the analysis wrong. For example the run loop in However, it conflates disk arbitration events which track NFS mount and the underlying disk. There is currently no tracking for the latter. Anyway, in your own words, did the change help to resolve your issue? |
|
As for any further improvements in this direction, I would prefer not to involve LaunchAgent. There is already one process running in the background which monitors the virtual machine and the NFS eject event. It could be extended to also watch for the disk being disconnected. |
Summary
When user physically disconnects USB-C cable from an
anylinuxfs-managed device without runninganylinuxfs unmountfirst, the macOS NFS client (default hard-mount semantics) retries indefinitely against the now-unreachable NFS server inside libkrun. The kernel holdsIOMediaBSDClientin busy state untilwatchdogdtriggerspanic(busy timeout[1])after 60s.This PR adds
soft,timeo=100,retrans=3to the default macOS NFS mount options so the kernel returnsEIOafter ~30s instead of hanging forever when the underlying VM/disk is gone.Reproduction
Reproduced 3 times over 8 days on Mac16,8 / M4 Pro with macOS 26.4.1 → 26.5 (panic persisted across OS update, confirming the bug is in our integration rather than macOS itself).
Identical signature in all three
panic-full-*.panicfiles:Reproduction steps:
sudo anylinuxfs mount /dev/disk5s1 -o noatime,compress=zstd:3/Volumes/<label>anylinuxfs unmount)mds_stores, etc.)Why current
deadtimeout=45is insufficientdeadtimeout=45(added infsutil.rs:113) helps Finder's manual eject path — Finder force-unmounts after 45s of unresponsive RPC. But it only counts toward force-unmount once Finder decides the mount is dead.For scheduled background I/O (Spotlight, Time Machine,
mds_stores, polling daemons), the kernel keeps retrying NFS RPCs indefinitely (hard-mount default), andIOMediaBSDClientstays busy → kernel watchdog fires at 60s beforedeadtimeoutresolves anything.Comment at
fsutil.rs:204-206:/// macOS relies on DiskArbitration teardown — no-op.This is an incorrect assumption —
DARegisterDiskDisappearedCallbackis only registered inside synchronousEventSession::wait_for_unmount(diskutil/darwin.rs:353-378). After the CLI exits, no run loop is running, so no callback fires on unexpected disconnect.What this PR changes
anylinuxfs/src/fsutil.rs—NfsOptions::default()on macOS now also inserts:Combined with existing
deadtimeout=45, this provides defense-in-depth against hot-unplug:EIOdeadtimeout=45)watchdogd: never triggers becauseIOMediaBSDClientis released within 30sTrade-offs
The trade-off of
softvshardmount:EIOinstead of indefinite hang / kernel panicEIOI think this trade-off is appropriate for
anylinuxfs's use case — these are external removable media, not always-online network drives. Operations against a phantom mount should fail clearly.If you'd prefer this gated behind a CLI flag (
--soft-mount) for opt-in, happy to refactor.Testing
cargo build -F freebsdpasses./run-rust-tests.shpasses (41 tests)fsutil::tests::default_nfs_opts_include_soft_mount_semanticsto lock in the new defaults — fails if any ofsoft,timeo=100,retrans=3, or existingdeadtimeout=45/vers=3is removedcargo fmtappliedWhat this PR does NOT address (future work)
There's a deeper architectural issue: even with
softmount, the proper fix is a persistent DiskArbitration listener that automatically triggers graceful unmount on disk-disappeared events for managed disks. The existingDARegisterDiskDisappearedCallbackmachinery indiskutil/darwin.rsis already imported — it just needs to live outside the synchronous CLI flow.I have a working external Python+pyobjc implementation as a local LaunchAgent that does this —
DARegisterDiskDisappearedCallbackin a persistent process triggers cleanup within ~100ms of physical disconnect. Three architectural approaches I considered for upstreaming it: (1) new long-lived daemon as LaunchAgent, (2) listener thread inside existing per-mount supervisor, (3) detect disk-removed event inside libkrun guest (vmproxy) and notify host via existing TCP control socket port 7350. Happy to submit a follow-up PR with design discussion if you're open to direction.But this PR is intentionally scoped small — it's a low-risk, immediate-impact fix that covers the primary failure mode (hot-unplug → kernel panic via background I/O) using the existing NFS option mechanism. The structural fix can come later.
Maintainer feedback questions
Thanks for
anylinuxfs— it's the cleanest way I've found to read btrfs on Mac without macFUSE/SIP compromises. Hoping to help make it production-grade for external removable media.